In this exercise, I learnt how perform multivariate analysis by plotting a heatmap plot in R
packages = c('seriation', 'dendextend', 'heatmaply', 'tidyverse')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
wh <- read_csv("data/WHData-2018.csv")
row.names(wh) <- wh$Country
wh1 <- dplyr::select(wh, c(3, 7:12))
wh_matrix <- data.matrix(wh)
We can use the heatmap() function on the matrix to plot a heatmap.
The Rowv = NA and Colv = NA are used to switch off the dendograms which are on by default.
wh_heatmap <- heatmap(wh_matrix,
Rowv=NA, Colv=NA)
wh_heatmap <- heatmap(wh_matrix)
We can normalise it column-wise.
heatmaply(wh_matrix[, -c(1, 2, 4, 5)],
scale = "column")
We can normalise it as well.
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]))
Similar to normalising, we can use the percentise method to divide the values by the maximum rank. Each value is the percent of observations that is above or below the value.
heatmaply(percentize(wh_matrix[, -c(1, 2, 4, 5)]))
We can use hierarchical clustering using the Euclidean distance and ward.D.
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
dist_method = "euclidean",
hclust_method = "ward.D")
However, to determine the best clustering method, the dend_expend() and find_k() functions of the dendextend package will be used
dist_methods hclust_methods optim
1 unknown ward.D 0.6137851
2 unknown ward.D2 0.6289186
3 unknown single 0.4774362
4 unknown complete 0.6434009
5 unknown average 0.6701688
6 unknown mcquitty 0.5020102
7 unknown median 0.5901833
8 unknown centroid 0.6338734
From this, we can see that the average method should be used as it has the highest optimum value.
Then, we can use find_k() to find the optimal number of clusters.
With that, we can now prepare the heatmap like below.
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
dist_method = "euclidean",
hclust_method = "average",
k_row = 3)
The clustering does not place the rows in a definite order. Hence, we need to use seriation package to optimise the distance between adjacent leaves (labels).
We can use the Optimal Leaf Ordering (OLO) by default to do this.
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
seriate = "OLO")
We can also use the Gruvaeus and Wainer (GW) for a faster heuristic.
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
seriate = "GW")
The option “mean” gives the output we would get by default from heatmap functions in other packages such as gplots::heatmap.2.
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
seriate = "mean")
The option “none” gives us the dendrograms without any rotation that is based on the data matrix.
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
seriate = "none")
We can apply a colour palette to improve the aesthetics of the heatmap.
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
seriate = "none",
colors = Blues)
We can pass additional arguments to the heatmap to further refine the figure.
In the code chunk below the following arguments are used:
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
Colv=NA,
seriate = "none",
colors = Blues,
k_row = 5,
margins = c(NA,200,60,NA),
fontsize_row = 4,
fontsize_col = 5,
main="World Happiness Score and Variables by Country, 2018 \nDataTransformation using Normalise Method",
xlab = "World Happiness Indicators",
ylab = "World Countries"
)